47 research outputs found
Extending Compositional Attention Networks for Social Reasoning in Videos
We propose a novel deep architecture for the task of reasoning about social
interactions in videos. We leverage the multi-step reasoning capabilities of
Compositional Attention Networks (MAC), and propose a multimodal extension
(MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level
fusion of input modalities (visual, auditory, text) over multiple reasoning
steps, by use of a temporal attention mechanism. We then combine MAC-X with
LSTMs for temporal input processing in an end-to-end architecture. Our ablation
studies show that the proposed MAC-X architecture can effectively leverage
multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the
task of Social Video Question Answering in the Social IQ dataset and obtain a
2.5% absolute improvement in terms of binary accuracy over the current
state-of-the-art
Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling
The study of speech disorders can benefit greatly from time-aligned data.
However, audio-text mismatches in disfluent speech cause rapid performance
degradation for modern speech aligners, hindering the use of automatic
approaches. In this work, we propose a simple and effective modification of
alignment graph construction of CTC-based models using Weighted Finite State
Transducers. The proposed weakly-supervised approach alleviates the need for
verbatim transcription of speech disfluencies for forced alignment. During the
graph construction, we allow the modeling of common speech disfluencies, i.e.
repetitions and omissions. Further, we show that by assessing the degree of
audio-text mismatch through the use of Oracle Error Rate, our method can be
effectively used in the wild. Our evaluation on a corrupted version of the
TIMIT test set and the UCLASS dataset shows significant improvements,
particularly for recall, achieving a 23-25% relative improvement over our
baselines.Comment: Interspeech 202
Investigating Personalization Methods in Text to Music Generation
In this work, we investigate the personalization of text-to-music diffusion
models in a few-shot setting. Motivated by recent advances in the computer
vision domain, we are the first to explore the combination of pre-trained
text-to-audio diffusers with two established personalization methods. We
experiment with the effect of audio-specific data augmentation on the overall
system performance and assess different training strategies. For evaluation, we
construct a novel dataset with prompts and music clips. We consider both
embedding-based and music-specific metrics for quantitative evaluation, as well
as a user study for qualitative evaluation. Our analysis shows that similarity
metrics are in accordance with user preferences and that current
personalization approaches tend to learn rhythmic music constructs more easily
than melody. The code, dataset, and example material of this study are open to
the research community.Comment: Submitted to ICASSP 2024, Examples at https://zelaki.github.io
The relationship between musical training and the processing of audiovisual correspondences: Evidence from a reaction time task
Numerous studies have reported both cortical and functional changes for visual, tactile, and auditory brain areas in musicians, which have been attributed to long-term training induced neuroplasticity. Previous investigations have reported advantages for musicians in multisensory processing at the behavioural level, however, multisensory integration with tasks requiring higher level cognitive processing has not yet been extensively studied. Here, we investigated the association between musical expertise and the processing of audiovisual crossmodal correspondences in a decision reaction-time task. The visual display varied in three dimensions (elevation, symbolic and non-symbolic magnitude), while the auditory stimulus varied in pitch. Congruency was based on a set of newly learned abstract rules: “The higher the spatial elevation, the higher the tone”, “the more dots presented, the higher the tone”, and “the higher the number presented, the higher the tone”, and accuracy and reaction times were recorded. Musicians were significantly more accurate in their responses than non-musicians, suggesting an association between long-term musical training and audiovisual integration. Contrary to what was hypothesized, no differences in reaction times were found. The musicians’ advantage on accuracy was also observed for rule-based congruency in seemingly unrelated stimuli (pitch-magnitude). These results suggest an interaction between implicit and explicit processing–as reflected on reaction times and accuracy, respectively. This advantage was generalised on congruency in otherwise unrelated stimuli (pitch-magnitude pairs), suggesting an advantage on processes requiring higher order cognitive functions. The results support the notion that accuracy and latency measures may reflect different processes